WEBVTT 00:00.000 --> 00:04.520 speaking on the question that I think is actually the best-named title of ShmooCon. 00:04.520 --> 00:11.360 Who's got time for that? Looking for advisories and blogs for threat data. 00:12.160 --> 00:16.040 Thanks for that introduction and for the shout out to the title. 00:16.040 --> 00:22.480 Hi everyone, I'm Shimon. I'm Elvis Hovar. And we are both security researchers 00:22.480 --> 00:28.280 working for Accenture Technology Labs. It's the R&D arm of Accenture, if 00:28.280 --> 00:35.200 you've heard of the large consulting firm. And who are we? So both of us are 00:35.200 --> 00:39.440 security researchers. We also have a research team that's focused on natural 00:39.440 --> 00:43.880 language processing. The core mission of Accenture Technology Labs as a whole is 00:43.880 --> 00:50.360 to address essentially challenging problems within certain domains. Elvis 00:50.360 --> 00:54.440 and I focus on security and because of the wealth of expertise there is, we also 00:54.440 --> 00:59.200 do a lot of cross-domain research. So we're essentially working on different 00:59.200 --> 01:03.760 research projects and seeing if we can address any real-world problems. So one 01:03.760 --> 01:07.840 of the things that we have been focused on over the last few months is the 01:07.840 --> 01:12.880 notion of how do you operationalize on threat data. And I will not use the word 01:12.880 --> 01:17.040 threat intelligence because one thing I want to stay away from is at least 01:17.040 --> 01:21.120 giving any false notion that we are trying to tackle the large problem of 01:21.120 --> 01:25.080 threat intelligence. You're still focused on threat data. And what we 01:25.080 --> 01:29.880 want to talk about is the techniques that we are researching to bring to 01:29.880 --> 01:35.600 bear up on from the NLP side, the natural language processing side, 01:35.600 --> 01:45.320 to the threat data extraction within the security domain. Right, so I just want to 01:45.320 --> 01:49.280 create a frame of reference around the problems that we want to talk about. And 01:49.280 --> 01:54.720 when we talk about the challenge space, again, again focus on threat data itself. 01:54.720 --> 01:57.640 Right, when we start talking about threat intelligence, there are a lot of connotations. 01:57.640 --> 02:02.120 Not talking about advisories, we're not talking about indicators, but the actual 02:02.120 --> 02:06.520 data that... How many of you are threat analysts in the room here or spend a lot 02:06.520 --> 02:11.400 of time looking at threat data? Right, it's you folks who are turning the data 02:11.400 --> 02:15.920 into intelligence. What we really want to focus on here is how do we 02:15.920 --> 02:19.840 put the right kind of tools that can make your life easier, make you more 02:19.840 --> 02:24.120 efficient. The human is not going out of the loop. The human will stay in the loop 02:24.120 --> 02:28.960 no matter what we say. So let's try and address challenges that humans are 02:28.960 --> 02:33.360 facing today. What are the kind of desired functionalities, right, as we 02:33.360 --> 02:37.520 started looking at the challenge space? There are some low hanging fruits and 02:37.520 --> 02:43.520 there are some pie in the sky, blue sky ideas. What is the entire spectrum of 02:43.520 --> 02:47.000 those ideas and which are the ones that we picked that we felt were at least 02:47.000 --> 02:52.720 addressable in the near term range? The framework behind how we are 02:52.720 --> 02:57.360 cross-pollinating ideas from natural language processing and information 02:57.360 --> 03:02.920 extraction into the security domain itself. I particularly haven't come 03:02.920 --> 03:05.960 across too much research within the space and I'll be interested in 03:05.960 --> 03:10.880 hearing more about from at the end. If there are any ideas and if you've seen 03:10.880 --> 03:18.040 other research work that's similar. The methodology and results. So this is an 03:18.040 --> 03:22.480 experimental framework. Natural language processing underlying that is machine 03:22.480 --> 03:27.240 learning. It is a continuous iterative process in terms of improving, 03:27.240 --> 03:31.960 leveraging the human intelligence and human expertise to make the machine part 03:31.960 --> 03:36.120 more intelligent. So it is an iterative process. We are going through 03:36.120 --> 03:42.800 this in terms of it is a work in progress, right, so the results that you see are 03:42.800 --> 03:48.800 real results and some of them good, some of them not so hot. So we'll be 03:48.800 --> 03:53.640 pointing out the kind of challenges that we are seeing. Don't want to just talk 03:53.640 --> 03:58.120 about this. This is not supposed to be a pedantic presentation. So what is it 03:58.120 --> 04:03.320 that we have put together that really brings this idea to life and what do we 04:03.320 --> 04:09.200 see going forward in terms of how do you make these improvements and again would 04:09.200 --> 04:14.900 love to hear feedback and opinions around what can be done based on the 04:14.900 --> 04:21.720 kind of things that you're seeing in your day-to-day lives. All right, so as 04:21.720 --> 04:25.920 you as you read through this, I'm not going to drain the slide, this should 04:25.920 --> 04:31.440 resonate with those of you who do spend even a little bit of time trying to 04:31.440 --> 04:36.440 study or trying to look through threat data on a day-to-day basis. To me 04:36.440 --> 04:43.360 it's the 80-20 problem of human analysis. Instead of spending 20% of your 04:43.360 --> 04:49.120 effort to get 80% of the results, it's kind of the other way around and correct 04:49.120 --> 04:54.520 me if I'm wrong, right, how many of you spend a lot of time pre-processing the 04:54.520 --> 04:58.560 advisories, the blogs, the data to actually put it into some kind of a 04:58.560 --> 05:03.800 usable format. Those are essentially pre-processing techniques that can be 05:03.800 --> 05:10.520 easily replicated using computing technology, right. So what we what you 05:10.520 --> 05:16.480 want to try and do in terms of addressing the core issue of trying to 05:16.480 --> 05:21.000 leverage humans to do things that they are actually good at and not wasting 05:21.000 --> 05:27.200 their time on techniques or processes that can be actually automated. Humans 05:27.200 --> 05:32.440 don't scale, right. There are only so many experts and analysts that 05:32.440 --> 05:36.480 we have. Some of the largest threat teams that I've seen are three or four 05:36.480 --> 05:41.920 people. If they are spending most of the time trying to read through documents 05:41.920 --> 05:47.200 then things are bound to fall to the crack. Again, relying on human memory to 05:47.200 --> 05:54.200 be able to detect or remember and co-correlate insights across several 05:54.200 --> 06:01.080 documents is another massive waste of time and energy and effort on 06:01.080 --> 06:06.600 behalf of these analysts. How many of you use spreadsheets to convert data from 06:06.600 --> 06:13.160 PDFs, text documents, blogs and use that as a continual source of repository? No 06:13.160 --> 06:16.480 one? All right, well that's good. It seems like a state of the art has 06:16.480 --> 06:22.640 moved on. So but it is a status quo right now. We are still focused on using 06:22.640 --> 06:27.040 humans where they are not efficient or rather they shouldn't be doing some 06:27.040 --> 06:33.760 of those things. So how do we use computing technology to move the dial 06:33.760 --> 06:41.440 towards that 80-20 rule, right. And as you as you look through this and these are 06:41.440 --> 06:46.920 kind of insights that we came across as we spoke to teams that we support and as 06:46.920 --> 06:54.280 security researchers ourselves is what can we do with thread data that or at 06:54.280 --> 06:57.640 least the problems related to thread data that have been solved in other 06:57.640 --> 07:02.480 domains. National language processing was one of the first things that that came 07:02.480 --> 07:07.960 to mind. Being able to automate the parsing or even semi automate the 07:07.960 --> 07:13.840 parsing of data into some kind of output that is semi-structured. That 07:13.840 --> 07:20.600 itself gets you much closer to being able to use at least some semi 07:20.600 --> 07:24.840 automated techniques for analysis of the data. Being able to share that data, right. 07:24.840 --> 07:29.400 Don't want to still go to the process of picking up the phone or sending an email 07:29.400 --> 07:34.040 which is again in some in some kind of an unstructured format expecting the 07:34.040 --> 07:38.320 person on the receiving end to parse through it and understand what that 07:38.320 --> 07:43.600 format means. So getting to a common understanding is something that we can 07:43.600 --> 07:50.560 get to using national language processing techniques. So as we as we 07:50.560 --> 07:56.240 started talking to the national language processing experts that we have within 07:56.240 --> 08:01.920 with an Accenture Technology Labs it became clear that they're that they have 08:01.920 --> 08:07.600 been working on addressing this problem in other domains. So we figured how do we 08:07.600 --> 08:13.480 bring security expertise, layer that with national language processing and try and 08:13.480 --> 08:17.680 at least address this problem to the point where we're not going to find the 08:17.680 --> 08:21.680 silver bullet here. That was that was never the intent but even if we can 08:21.680 --> 08:25.800 increase the efficiency of human analysts by even 10% I think we are 08:25.800 --> 08:29.240 taking a step in the right direction. 08:31.480 --> 08:37.560 So again I'm not going to drain the slide. It's pretty pretty simple overview of 08:37.560 --> 08:44.560 the framework that we are using to to create this solution for a research 08:44.560 --> 08:48.960 project but there are a couple of things that I do want to focus on. First one is 08:48.960 --> 08:52.840 the information extraction module. I'll be going through exactly why we picked a 08:52.840 --> 08:56.440 specific application within natural language processing which is information 08:56.440 --> 09:02.800 extraction. The other is standardization. The whole one of the objectives that we 09:02.800 --> 09:08.320 set out to achieve was converting this unstructured or semi-structured data into 09:08.320 --> 09:14.160 a common understanding or commonly understood format. Why STICS? Well I'm not 09:14.160 --> 09:18.560 going to go into the justification and rationale behind STICS itself. 09:18.560 --> 09:24.760 It's a standardization effort. Is Sean Barnum here? No. He had a 09:24.760 --> 09:29.800 great talk last year about how standardization efforts help with 09:29.800 --> 09:35.600 sharing insights. The guys at MITRE are doing some really great work. So again 09:35.600 --> 09:39.720 the notion was need to get to a standardized format and we went with 09:39.720 --> 09:47.160 STICS. We also instead of storing the data in just a traditional 09:47.160 --> 09:53.200 relational database we decided to go with a graph database and now as you as 09:53.200 --> 09:58.960 you look at how threat data itself is represented it is multi-dimensional. 09:58.960 --> 10:04.520 There are lots of relationships and being able to maintain these relationships 10:04.520 --> 10:10.360 in a relational database we found out very quickly was was going to be a task 10:10.360 --> 10:16.240 not not worth pursuing. Graph database themselves lend themselves really well 10:16.240 --> 10:22.680 to representing this hyperdimensional relations between between the data 10:22.680 --> 10:29.720 itself. So being able to quickly recall the degrees of separation between 10:29.720 --> 10:33.960 different data between different threat data and the documents that they're 10:33.960 --> 10:38.320 assigned in those were the kind of things that we wanted to represent and 10:38.320 --> 10:42.600 be able to do it quickly. And the third module that I really want to focus on is 10:42.600 --> 10:47.080 the learning part. As I mentioned at start this is a machine learning 10:47.080 --> 10:52.040 technique at the end of the day. There needs to be a way of continually 10:52.040 --> 10:59.160 sending feedback into into the system. So we wanted to give the analysts the 10:59.160 --> 11:04.120 experts the ability to either say yes a certain pattern is good or a certain 11:04.120 --> 11:10.080 pattern is not good. This is this is not a perfect system so we want to use the 11:10.080 --> 11:15.600 effort that analysts are already putting into analyzing this data and help the 11:15.600 --> 11:23.200 system improve over time. NLP 101 just a quick primer. How many of you are 11:23.200 --> 11:26.120 familiar with natural language processing techniques or have done the 11:26.120 --> 11:30.360 same kind of research on it? So there's a few of you here. The primary 11:30.360 --> 11:37.440 purpose is to replicate how humans process language and be able to replicate 11:37.440 --> 11:41.760 the linguistic analysis techniques. Several different applications of 11:41.760 --> 11:44.520 natural language processing. Natural language processing itself is kind of a 11:44.520 --> 11:49.720 catch-all term. Within the field itself there are several very 11:49.720 --> 11:53.680 specific applications and outputs that you get from these different 11:53.680 --> 11:58.640 applications. We focused on information extraction and the idea behind 11:58.640 --> 12:03.160 information extraction is to be able to automate the task of identifying, 12:03.160 --> 12:08.280 collecting, and normalizing the relevant information into a structured output. 12:08.280 --> 12:13.840 So information extraction is more than just keywords. It's more than just 12:13.840 --> 12:17.680 string processing although there are some of those elements involved in 12:17.680 --> 12:22.800 there. But we are not going to the point or to the extent of fully understanding 12:22.800 --> 12:29.000 the text the way humans have the capability of doing. So these 12:29.000 --> 12:33.840 applications are very domain specific. Information extraction works well if you 12:33.840 --> 12:38.920 can represent the human expertise within a specific domain the right way. So that 12:38.920 --> 12:47.000 was where a lot of our time and effort was spent. Great school anyone? 12:47.000 --> 12:52.560 This is just a quick review of the linguistic analysis that we go through. 12:52.560 --> 12:57.680 Humans do this really easily, very intuitively. Trying to have a 12:57.680 --> 13:01.680 machine do this is an extremely difficult challenge. It is not a 13:01.680 --> 13:06.200 trivial challenge. So as we go through the increasing levels of 13:06.200 --> 13:11.600 analysis, we looked at doing morphological analysis which is breaking 13:11.600 --> 13:17.360 the words or a complex word into its basic form. So if you take for example 13:17.360 --> 13:21.400 the sentence, the three buffer overflow vulnerabilities could allow remote 13:21.400 --> 13:25.560 code execution. So when you look at the word overflow, 13:25.560 --> 13:30.320 break it down into over and flow. Try and get to the basic units. Understanding the 13:30.320 --> 13:34.080 syntax, so are the words correctly placed and what kind of meaning are you getting 13:34.080 --> 13:39.120 out of it? And the third level of analysis, the semantic analysis itself, 13:39.120 --> 13:43.960 which is really understanding what does a word mean within the context of the 13:43.960 --> 13:47.280 sentence. So if you look at the word buffer, there are several meanings. There's 13:47.280 --> 13:52.880 noun form, the verb form. Within the context of computation, it means a 13:52.880 --> 13:58.240 storage, right? It's storage. So being able to get to the right meaning of the word 13:58.240 --> 14:05.120 is extremely important. So as we started looking at and working with our 14:05.120 --> 14:08.720 natural language processing experts, these were the kind of things that we 14:08.720 --> 14:15.080 really focused on. So just a high level summary of approach. Now here's where we 14:15.080 --> 14:21.080 really dive into the kind of framework that we set up for extracting the data 14:21.080 --> 14:29.200 and also we'll talk about the different algorithms that we used for 14:29.200 --> 14:36.040 analyzing the data. So it was a six-step process. The first step, just break the 14:36.040 --> 14:42.040 entire text into its sentences, right? So sentence segmentation. The second step, 14:42.040 --> 14:47.080 which was a core part of our work, was to generate patterns related to each of the 14:47.080 --> 14:53.600 STICs data constructs. So when you look at IOCs, TTPs, course of action, we 14:53.600 --> 15:00.760 generated base patterns that would be used for the analysis in later stages. 15:00.760 --> 15:05.480 Step three was stemming. This is again an NLP technique to reduce the 15:05.480 --> 15:12.440 complexity in the actual language itself. So stemming essentially boils or 15:12.440 --> 15:17.440 distills different forms of the word to its basic form. Think hacker, hackers, 15:17.440 --> 15:22.040 hacking, they all boil down to hack. So that's the idea behind stemming, 15:22.040 --> 15:28.200 reducing the complexity. Now step four is where you take each sentence and 15:28.200 --> 15:33.920 calculate the similarity score by comparing the sentence against all the 15:33.920 --> 15:40.360 patterns that we had already generated in step two, right? So if you 15:40.360 --> 15:44.000 take a sentence, compare it against all the patterns in course of action. Take a 15:44.000 --> 15:49.720 sentence, compare it against all the patterns within TTPs, IOCs, and so on. 15:49.720 --> 15:54.240 Once you went through that entire process, we'd find the best match in 15:54.240 --> 16:01.000 terms of the similarity score, the pattern set that was or the patterns 16:01.000 --> 16:07.440 that were actually matched against that sentence with the highest score. And then 16:07.440 --> 16:14.160 we repeated steps four and five for each STIX data construct. So that was 16:14.160 --> 16:18.960 the basic idea behind this entire process. 16:20.400 --> 16:26.160 So for the pattern generation itself, this is a very crucial 16:26.160 --> 16:30.760 component of information extraction. Garbage in, garbage out. Very true 16:30.760 --> 16:34.920 for information extraction. So it was important that we started out with 16:34.920 --> 16:39.080 at least the right base patterns. We use a couple of different approaches 16:39.080 --> 16:46.360 for this. The first one was a supervised approach. We actually, Elvis, 16:46.360 --> 16:51.320 manually annotated the pattern list that needed to be learned. This is essentially 16:51.320 --> 16:58.440 a classification challenge. So you have a corpus of documents, manually annotate 16:58.440 --> 17:04.360 them, feed them through the classifier, and let the classifier learn, right? So 17:04.360 --> 17:08.280 we call it a supervised approach. And then we also use a semi-supervised 17:08.280 --> 17:14.280 approach. This was used to boost the number of patterns, the base patterns 17:14.280 --> 17:20.200 that we could generate. The supervised approach is fairly tedious. It does 17:20.200 --> 17:27.320 require a lot of time and effort. So we wanted to boost the actual pattern list. 17:27.320 --> 17:32.360 And we used the semi-supervised approach. The idea here is you create a set of seed 17:32.360 --> 17:37.080 patterns and you feed it documents that are not annotated. And because they're not 17:37.080 --> 17:41.360 annotated, this goes through a bootstrapping process where it keeps learning 17:41.360 --> 17:47.320 over time. There is human intervention required. If you just allow it to learn 17:47.320 --> 17:51.000 on its own and if it keeps learning garbage, then you'll end up with garbage 17:51.000 --> 17:58.280 at the end. So that was another part of our effort was to find the right kind of 17:58.280 --> 18:03.200 base patterns. And let me quickly go to the next slide so Elvis can then really 18:03.200 --> 18:09.000 show you what this looks like in real life. So we went through several 18:09.000 --> 18:16.120 iterations of training and testing. We used documents and advisories from MSISAC, 18:16.120 --> 18:23.080 FSISAC, and ICSRT, ran it through the information extraction process, matched 18:23.080 --> 18:27.160 it against all the learned patterns that we had created, right? So the patterns 18:27.160 --> 18:31.320 that we created in the previous step, matched it against those, and we 18:31.320 --> 18:36.200 calculated three different similarity scores, lexical, semantic, and contextual 18:36.200 --> 18:41.480 similarity. Now each of these have their own advantages and disadvantages. 18:41.480 --> 18:46.320 So the lexical similarity score works well when you're looking for co-occurring 18:46.320 --> 18:56.600 set of words. So think Siemens and Symantec or any other ICS domain product. 18:56.600 --> 19:01.080 There are a certain set of co-occurring words that you can identify. The 19:01.080 --> 19:05.640 semantic similarity score tried to detect a concept through a word or a 19:05.640 --> 19:12.280 phrase, and then the contextual similarity score analyzed a certain window of words 19:12.280 --> 19:18.120 around a known pattern to try and understand the context. As Elvis goes to 19:18.120 --> 19:22.640 the demo, he'll talk about why we generated the three different scores and 19:22.640 --> 19:29.920 how they were useful in being able to map to the right STIX construct. So 19:29.920 --> 19:35.560 aggregate the score, look at the result, and if the results good, then make sure 19:35.560 --> 19:39.800 that the pattern is in the pattern list. If not, add it to it. If the score is bad, 19:39.800 --> 19:44.400 try and understand what went wrong, finding the parameters. This is your 19:44.400 --> 19:49.800 traditional training and testing process that you use for machine learning. So 19:49.800 --> 19:53.640 enough slides. Just want to make sure that everyone's awake. Threat intelligence. 19:53.640 --> 20:01.880 Any schmooble? No. All right. So Elvis. 20:04.000 --> 20:13.920 Can I drop the mic when I'm done? Hi everyone. My name is Elvis Hovor. I have 20:13.920 --> 20:18.240 been with Accenture Technology Labs about three years. I started out of grad 20:18.240 --> 20:23.440 school. I lead some of the development work in threat intelligence down there. 20:23.440 --> 20:52.640 Okay, so let's start the demo if I can find it. 20:53.440 --> 21:01.720 I don't have anything higher than 1440. That's the best I have. 21:01.720 --> 21:25.120 Anyway, is it showing any better now? What can I do? Help here. Reduce it. Zoom out. 21:25.120 --> 21:32.160 Okay. That didn't help. 21:39.040 --> 21:51.280 That good? Thank you. Okay. So I think like Shimon said, initially we started with 21:51.280 --> 21:55.800 trying to think about, you know, how to help analysts really prioritize it. 21:55.800 --> 22:01.560 Advice we study get on a daily basis. I'm sure that a bunch of analysts here you 22:01.560 --> 22:06.800 get documents on the regular that you need to read through every morning. Try 22:06.800 --> 22:11.440 and figure out which one to read first is a task on its own. You know, which one 22:11.440 --> 22:15.860 has the most important information or relevant information to you is also 22:15.860 --> 22:19.400 another task that you have to go through. Wasting your time trying to figure out 22:19.400 --> 22:23.200 which document to read and all of that. Shimon went through a lot of that. What we 22:23.200 --> 22:27.800 wanted to do with our research project was initially started with just trying 22:27.800 --> 22:33.400 to prioritize these documents. But as time went on we decided to wrap a UI 22:33.400 --> 22:38.760 around that to make it easy and more usable for anyone that wanted to use a 22:38.760 --> 22:42.800 tool. So we moved away from just the research portion of it to try and put a 22:42.800 --> 22:50.200 UI on top of it. What you see here is, you know, on a regular day you come in and 22:50.200 --> 22:54.760 load up bulk documents basically. It is placed on this page and it 22:54.760 --> 23:00.080 shows you scores on which one we think is, well, you have prioritized the most. 23:00.080 --> 23:04.920 Once again this is a research idea we give a scores because we thought that 23:04.920 --> 23:08.400 those were a good way to score the documents that we have. But, you know, 23:08.400 --> 23:13.200 based on your own preferences you can score it differently and it will show 23:13.200 --> 23:17.000 differently. But you look here and then you see a set of documents and these 23:17.000 --> 23:20.820 are the ones that are prioritized for you to read first and second and third. It 23:20.820 --> 23:25.440 came in on different days. We have a set of data that spans maybe three days. We 23:25.440 --> 23:29.680 wish we had more but it spans about three or four days. That's about how much 23:29.680 --> 23:34.680 we have. I would try and go through the process with you loading up how we are 23:34.680 --> 23:38.800 calculating the scores and what kind of information we are getting out of it. 23:38.800 --> 23:44.440 The data elements we are able to extract out of each document. What extraction 23:44.440 --> 23:50.040 method works better for us and what we realized would work to extract certain 23:50.040 --> 23:54.520 six elements or constructs better than other ones. And so as I go through I'll 23:54.520 --> 24:02.920 explain that more. So come here and upload a data. First I would upload a file that 24:02.920 --> 24:09.720 has, I guess, the patterns which trained or according to that file structure. So 24:09.720 --> 24:16.120 extracts a lot better if you trained your your pattern set against a specific 24:16.120 --> 24:21.040 document and have trained your modules over time to be able to extract that 24:21.040 --> 24:27.160 data more. So you would see that one works very well and the other works 24:27.160 --> 24:33.120 somewhat okay. So let me try and upload. 24:40.280 --> 24:44.000 Okay this is a file I can upload. 24:44.000 --> 24:58.040 Oh and we were trying to, we're trying to make it a batch file because it takes 24:58.040 --> 25:01.920 sometimes it takes just a little bit of time to run a single file. So we're 25:01.920 --> 25:05.120 thinking if someone has multiple files that they want to run you want to be 25:05.120 --> 25:08.000 able to make sure that you can make it a bad job for the first one to process 25:08.000 --> 25:12.240 and the second one to come in. But for some reason Java wouldn't allow, well we 25:12.240 --> 25:17.400 haven't figured out how to have the computer run the Java command on its own 25:17.400 --> 25:21.440 because it's kind of like sandbox and would allow us unless we enter the 25:21.440 --> 25:24.920 command. So I have to enter it here but that gives you an opportunity to see 25:24.920 --> 25:29.640 what's going on in the background. How the you know how it's comparing to the 25:29.640 --> 25:33.440 patterns like Shimon showed earlier on. 25:33.440 --> 25:47.120 So that's the extraction that's going on right now. So if I get back to my page 25:47.120 --> 25:53.040 here and I go to my prioritized list at the end of a list table you should be 25:53.040 --> 26:11.600 able to see the file that we just entered. We shorted out a little bit but 26:11.600 --> 26:16.560 what eventually happens is that when you run it through and the natural language 26:16.560 --> 26:19.240 processing module is running in the back and they should tell you that it's being 26:19.240 --> 26:24.880 processed and when it's done when you refresh your page you would see it shows 26:24.880 --> 26:27.640 an output that says that it's been processed and so you can go ahead and 26:27.640 --> 26:30.560 start looking at the file. So if you have a bunch of files this is going to be a 26:30.560 --> 26:39.080 batch process that keeps going through and taking the files through. I have to 26:39.080 --> 26:44.320 shift this a little bit just so I can get to that. So that's it. The file is 26:44.320 --> 26:48.720 uploaded it's running right now. I'm sure it's done if I highlight it it should 26:48.720 --> 26:55.640 move. Please pay attention to the file name I think it's USCIC cert. 27:18.720 --> 27:28.000 Okay yeah thank you. 27:32.120 --> 27:39.640 So now we that means that it found enough information in there to think and 27:39.640 --> 27:44.480 score it to take it to the third in the in your stack of documents that you have. 27:44.480 --> 27:48.600 I'll tell you a little bit about the scoring as we go through and why the 27:48.600 --> 27:53.160 scoring is that way. It's not perfect yet but as time goes on I think we would 27:53.160 --> 27:56.320 tweak the scoring a lot more to get some of the information that we want out of 27:56.320 --> 28:00.560 it. So the document will show on the right hand side and this elements 28:00.560 --> 28:04.480 extracted would be on the right hand side. So you see your COAs all the things 28:04.480 --> 28:13.380 that it was able to extract out of the document you would see from here. So for 28:13.380 --> 28:19.640 COAs we can see that extracted a bunch of things out. The score you see here is 28:19.640 --> 28:25.880 1.0. 1.0 because anything that comes out of contextual similarity extraction if 28:25.880 --> 28:31.120 we use that method gives us a binary number it's either 0 or 1. It was able to 28:31.120 --> 28:34.960 extract something it wasn't. Semantic and lexical are a little more different in 28:34.960 --> 28:39.320 that the scores actually work and you can tweak them very well. So that is one 28:39.320 --> 28:43.240 part that kind of skews our score a little bit but what ends up happening is 28:43.240 --> 28:47.600 that if you still don't find anything for or if you find contextual which is 28:47.600 --> 28:53.200 one we decided to average it out for a document. So if lexical and semantic 28:53.200 --> 28:56.840 similarity methods didn't find anything you would still average it out by three 28:56.840 --> 29:08.040 and it ends up reducing the value of the score. So we were able to find some COAs. 29:08.040 --> 29:14.240 It is it gets because of the patterns that we have sometimes it can extract 29:14.240 --> 29:19.120 certain information that is not necessarily accurate or extracts before 29:19.120 --> 29:24.320 it's supposed to extract the full sentence. We because of this we wanted to 29:24.320 --> 29:27.840 build in that like Shimon said earlier on a learning module to make sure that 29:27.840 --> 29:34.080 we are correcting this automatically or you know then the module in the back end 29:34.080 --> 29:39.080 is learning to update itself and delete the patterns that don't work and all of 29:39.080 --> 29:43.640 that. So we put the Facebook style thumbs down there just to you know delete 29:43.640 --> 29:49.360 anything that we believe is wrong. So now it's taking all of the how-to's out 29:49.360 --> 29:54.760 because it was extracting how-to's and what we built inside that again is the 29:54.760 --> 29:59.580 ability to delete this pattern whatever pattern is pulling out the how-to if it 29:59.580 --> 30:03.000 reaches a certain threshold if you get it gets a certain number of 30:03.000 --> 30:07.320 and thumbs down. So that way the module is learning and as time goes on we want 30:07.320 --> 30:13.040 to be able to build functionality for analysts to be able to you know 30:13.040 --> 30:18.160 highlight a text or highlight a pattern that he thinks the natural language 30:18.160 --> 30:21.560 processing module is missing and then it's going to add on to a set of a 30:21.560 --> 30:28.640 documents that we have. You can see here TTPs that IOCs in here and they pick up 30:28.640 --> 30:33.520 pretty good ones if the document is really known and and the patterns were 30:33.520 --> 30:38.080 tested with that document. I would go in and pick a regular document from cert 30:38.080 --> 30:42.800 and upload that and see what the difference is you would see that some of 30:42.800 --> 30:46.760 them would not be as great as what we have and with the documents that have 30:46.760 --> 31:01.200 been tested already. So what, first one? 31:16.760 --> 31:38.640 Goodness. Okay, TXT. Upload it again. 31:46.760 --> 32:02.640 Once again we should be able to see it being processed and when we highlighted 32:02.640 --> 32:07.120 when it's done we should be able to get into the file and check that out. So 32:07.120 --> 32:10.680 whilst that is running and these are some of the things that these are some 32:10.680 --> 32:15.480 of the findings that we we saw as we the research project went on. We're hoping 32:15.480 --> 32:20.880 that we could use one of the similarity methods, the lexical contextual or 32:20.880 --> 32:25.080 semantic similarity methods to extract the information that we needed but we 32:25.080 --> 32:32.480 we realized that each one has its own advantage so it would it would not be 32:32.480 --> 32:37.440 wise for us to end up to just choose one and leave the others because we we saw 32:37.440 --> 32:42.120 that maybe for contextual similarity was able to pull things like thread actors 32:42.120 --> 32:45.400 the things that have very specific elements that need to be pulled out like 32:45.400 --> 32:51.440 a word or you know a very small phrase a phrase that isn't too long and then for 32:51.440 --> 32:57.720 things like TTPs that sometimes can span multiple lines we saw that using things 32:57.720 --> 33:00.840 like semantic or lexical similarity methods will pull those out better for 33:00.840 --> 33:05.080 us so we ended up using an aggregate of all of these three to be able to get 33:05.080 --> 33:15.720 this the information that we needed. So I believe it is done and we can check that out. 33:15.720 --> 33:18.120 Э 33:34.240 --> 33:38.320 That's that's it. 33:38.320 --> 33:54.760 So out of this document, it was able to pull a bunch of exploit targets for us. 33:54.760 --> 34:00.960 As you can see, it grabs, sometimes it can grab the sentence that you want and not necessarily 34:00.960 --> 34:02.160 give you the element that you want. 34:02.160 --> 34:04.760 It might grab it before, it might grab it after. 34:04.760 --> 34:11.520 It's just, I guess, the patterns and then having to get the right pattern match for 34:11.520 --> 34:12.520 each document. 34:12.520 --> 34:16.200 The thing with natural language processing, and I think it's natural language processing 34:16.200 --> 34:20.480 as a whole, it's still being developed to get to that fine-grained point where it can 34:20.480 --> 34:23.360 pull very specific information. 34:23.360 --> 34:29.400 And so as much as we were trying to get it, and this is also, I guess, a project that 34:29.400 --> 34:34.020 we are still building on and trying to get to the next step, we do have a few things 34:34.020 --> 34:38.040 that need to be tweaked here and there, but it is extracting some information. 34:38.040 --> 34:44.320 So if you, as an analyst, can get to move away from doing all of these things and having 34:44.320 --> 34:47.880 just a document moving up and telling you that I have this amount of information on 34:47.880 --> 34:51.600 there that is related to threat information, I think you should look at that and see if 34:51.600 --> 34:53.200 that applies to you. 34:53.200 --> 34:56.120 It's going to make your job a little more easier. 34:56.120 --> 35:02.080 So after we were done with putting this information in a more structured format, which is the 35:02.080 --> 35:07.720 sticks that we have here, and I think it's a caveat to putting it in a structured format 35:07.720 --> 35:12.360 in sticks, and I'll put that in our challenges at the end, but we wanted to do the higher 35:12.360 --> 35:16.280 order analysis that I think Shimon spoke about early on. 35:16.280 --> 35:21.760 I mean, if we have the information in a more structured format, there's so much we can 35:21.760 --> 35:22.760 do with it. 35:22.760 --> 35:28.560 So we decided to take some examples and do some D3 visualizations on top of that, of 35:28.560 --> 35:32.160 the data that we have to see what we can pull out and what we can do with it. 35:32.160 --> 35:34.320 Like we said, it's just examples. 35:34.320 --> 35:35.320 We had a graph database. 35:35.320 --> 35:39.080 We were pulling out the information out and trying to see what we could relate together 35:39.080 --> 35:45.960 and the kind of information that we can pull out of it now that it's more structured. 35:45.960 --> 35:47.800 So we decided to use this tree map. 35:47.800 --> 35:55.440 This is one of the first ones that we have. 35:55.440 --> 36:01.920 To show the information that we have in there, and thanks to Sean, D3 guy, he helped us a 36:01.920 --> 36:04.200 lot on this. 36:04.200 --> 36:11.000 So we decided that if we could use a tree map to see what information is contained in 36:11.000 --> 36:14.000 the documents, then you can have a general idea. 36:14.000 --> 36:19.080 If you limit the documents that you have to maybe a one week set of documents, and then 36:19.080 --> 36:29.520 you want to see what has shown up a lot in your advice that you've collected. 36:29.520 --> 36:36.960 I think one thing that I've forgotten to say is that as analysts, or as the threat data 36:36.960 --> 36:43.520 team, or the threat analyst team, you collect documents that are usually related to your 36:43.520 --> 36:44.520 organization. 36:44.520 --> 36:47.720 So some of this information can get very specific to your organization because you're collecting 36:47.720 --> 36:53.440 data that's around the kind of operations that you guys run in your organizations. 36:53.440 --> 36:59.160 So here you can come here and just look for maybe threat actors and see in the last week 36:59.160 --> 37:04.080 which threat actors have shown up a lot in the documents that I have, or in the last 37:04.080 --> 37:10.520 day, maybe last month, which one is coming up a lot and why should I be worried about 37:10.520 --> 37:11.520 that. 37:11.520 --> 37:16.640 And then you can do the same for COAs, go into your COAs and see what kind of COAs are 37:16.640 --> 37:17.640 out there. 37:17.640 --> 37:23.560 And the COAs are, the word sentences, or very long sentences, trying to display that in 37:23.560 --> 37:28.560 D3 was a little bit of a hassle too. 37:28.560 --> 37:30.960 We're still trying to figure that out, like we said. 37:30.960 --> 37:40.080 So we want to be able to give the user the ability to be able to see what the whole COA 37:40.080 --> 37:41.080 is. 37:41.080 --> 37:44.840 When I say COA here, I mean causes of action. 37:44.840 --> 37:45.840 I'm sorry. 37:45.840 --> 37:51.560 I'm using a lot of acronyms, am I not? 37:51.560 --> 37:53.560 Anyway. 37:53.560 --> 37:57.320 The other visualization we decided to use was historical analysis. 37:57.320 --> 38:05.960 Here we wanted to either give the analyst the ability to widen the net or tighten the 38:05.960 --> 38:06.960 net some more. 38:06.960 --> 38:10.760 You have documents, and I think one of the problems that we're talking about earlier 38:10.760 --> 38:16.280 on is that as an analyst, you, or as a human trying to analyze these documents, you seem 38:16.280 --> 38:21.960 to may have seen some element in there, maybe a threat actor or a TTP that you were concerned 38:21.960 --> 38:26.320 about, but it was way back, let's say four weeks back or a month back, and you don't 38:26.320 --> 38:28.480 remember which document it was anymore. 38:28.480 --> 38:32.520 You never did anything about it, but all of a sudden you see that picking up again and 38:32.520 --> 38:36.120 you want to be able to find those documents that are related to that document that you 38:36.120 --> 38:38.280 just saw, but there's no way you can remember it. 38:38.280 --> 38:41.520 You don't even know sometimes if you ever came across that. 38:41.520 --> 38:47.400 We wanted to show that information here and make sure that we give you a historical view 38:47.400 --> 38:56.080 of the kind of documents that you have in your repository or your database. 38:56.080 --> 39:01.360 What you see here is a list of the documents that we have in here. 39:01.360 --> 39:04.280 It's a limited list that we have. 39:04.280 --> 39:08.920 Anytime you mouse over a document, let's say you just put a new document in, it shows up. 39:08.920 --> 39:14.760 Anytime you mouse over it, you see the documents that are connected to that document in any 39:14.760 --> 39:15.760 way. 39:15.760 --> 39:18.960 If it's a set of threat actors that are connected, it's going to show you a different color. 39:18.960 --> 39:22.320 If it's a set of IOCs that are connected, it's going to show you a different color. 39:22.320 --> 39:27.280 It just gives you an idea of how your documents are connected together and which ones you 39:27.280 --> 39:32.120 should be paying attention to if you ever want to go back and read or look at some of 39:32.120 --> 39:33.920 information that you have in there. 39:33.920 --> 39:39.360 Like we said, this is all going into our Neo4j database. 39:39.360 --> 39:42.400 So Neo4j has a frontend query. 39:42.400 --> 39:46.320 You can go in there and query the data that you want and do it whichever way you want. 39:46.320 --> 39:47.320 This is D3. 39:47.320 --> 39:48.440 It's all open source. 39:48.440 --> 39:52.000 You can decide to use whichever visualization you want to use on top of this data to pull 39:52.000 --> 39:54.600 whatever you feel is most necessary for you. 39:54.600 --> 39:59.200 These were just examples we wanted to show. 39:59.200 --> 40:10.080 And so some of the challenges that I guess we had going into this and we realized were 40:10.080 --> 40:13.560 going to be a hindrance for us is that, you know, stakes. 40:13.560 --> 40:14.560 We wanted to put it in stakes. 40:14.560 --> 40:15.560 We said stakes. 40:15.560 --> 40:17.640 We wanted to make sure that everything is in stakes. 40:17.640 --> 40:19.000 It's more structured. 40:19.000 --> 40:21.520 But stakes is very expensive. 40:21.520 --> 40:27.040 And the data or the elements that you need to extract into stakes can get very, very, 40:27.040 --> 40:32.480 very granular and you need to make sure that you capture that out of a sentence or you 40:32.480 --> 40:36.720 capture that without missing the meaning of that in the sentence. 40:36.720 --> 40:38.000 And it became very difficult. 40:38.000 --> 40:41.840 We tried with natural language techniques to get granular and we saw that we were missing 40:41.840 --> 40:43.720 a lot of information that we needed. 40:43.720 --> 40:47.600 So we decided to expand it a little more and go for the broader stakes constructs, which 40:47.600 --> 40:53.120 is IOCs and, you know, IOCs, threat actors, things of that sort. 40:53.120 --> 40:54.120 It's broader. 40:54.120 --> 40:55.680 That way we can capture a lot more. 40:55.680 --> 41:00.800 You might have to capture it in a sentence instead of capturing the element, but it still 41:00.800 --> 41:03.240 gives you a better idea of what's in there. 41:03.240 --> 41:07.600 And as time goes on and as this is developed and as people, you know, put a lot more effort 41:07.600 --> 41:12.160 into it, we believe that it's going to get to that point where the natural language processing 41:12.160 --> 41:17.440 engine will be able to go directly and look for things like affected systems and just 41:17.440 --> 41:21.720 pull that affected system out and put it in a very, very specific place that a machine 41:21.720 --> 41:22.720 can easily use. 41:22.720 --> 41:29.720 Do you have anything? 41:29.720 --> 41:34.720 Okay. 41:34.720 --> 41:53.480 Yeah, so that was one of it and also I think this is one of the most basic NLP problems 41:53.480 --> 42:00.000 around having one where that means something else in a different sentence. 42:00.000 --> 42:01.400 So it's difficult. 42:01.400 --> 42:07.680 As humans, we are easily able to kind of make those out, but for a computer, it's a lot 42:07.680 --> 42:13.200 more difficult for the computer to make out that, you know, maybe, what? 42:13.200 --> 42:17.240 Give me an example. 42:17.240 --> 42:19.320 Maybe attacker, right? 42:19.320 --> 42:23.160 In one sentence means that that's the person that actually committed a crime. 42:23.160 --> 42:26.120 It's a threat actor. 42:26.120 --> 42:31.240 And it's simple form, attack simple form is actually attack, right? 42:31.240 --> 42:35.240 And you know, for the computer, it sees it, it tries to break it down into all of its 42:35.240 --> 42:40.640 simplest forms for natural language processing and all of a sudden that attack just means 42:40.640 --> 42:44.320 it's trying to explain something else in the middle of a sentence somewhere in TTP. 42:44.320 --> 42:48.000 It ends up pulling the same attack just like it's trying to pull attacker. 42:48.000 --> 42:50.360 So things like that made it very difficult. 42:50.360 --> 42:55.120 These are underlining NLP problems and I think that as we go forward and we put a little 42:55.120 --> 43:00.640 more effort into it as a community, it's going to improve and this is going to help in some 43:00.640 --> 43:03.160 of the work that we're trying to do. 43:03.160 --> 43:04.160 All right. 43:04.160 --> 43:09.080 Let me just take it home. 43:09.080 --> 43:12.960 So yeah, in terms of what are we trying to do moving forward, right? 43:12.960 --> 43:17.760 Like we talked a little bit about some of the challenges that we face, some of the results 43:17.760 --> 43:19.960 that work, some of the patterns that work. 43:19.960 --> 43:24.320 What we, one of the first things that we want to do is extend the functionality of tagging 43:24.320 --> 43:26.360 new patterns from the screen saw, right? 43:26.360 --> 43:32.680 So that's really put the power of the tool in the hands of the analyst. 43:32.680 --> 43:35.560 There's a lot of fine-tuning to be done with the weighting parameters. 43:35.560 --> 43:41.080 We are, as we go forward, one of the things that we found when we were doing the analysis, 43:41.080 --> 43:47.280 threat actors and I was talking about threat actors, TTPs, were getting really accurate 43:47.280 --> 43:52.800 results using contextual similarity, but then there are highly specific elements within indicators 43:52.800 --> 43:58.720 of compromise, within observables that are better suited for lexical and semantic analysis. 43:58.720 --> 44:04.280 So we are looking at those components as we go through the iterative process of making 44:04.280 --> 44:09.200 this solution and the underlying engine better. 44:09.200 --> 44:11.600 Expand the pattern sets for different types of documents. 44:11.600 --> 44:17.320 Right now we are heavily focused on USERT, ICSERT, and MS-ISAC. 44:17.320 --> 44:22.520 Those documents tend to have a lot more vulnerability, courses of action. 44:22.520 --> 44:25.440 Sometimes they might have threat actor and campaign names. 44:25.440 --> 44:26.520 Unusual. 44:26.520 --> 44:34.640 So we want to expand the actual corpus to include that spectrum of information. 44:34.640 --> 44:38.440 And one of the next things that the National Language Processing team is really going to 44:38.440 --> 44:43.120 focus on is being able to incorporate named entity recognition. 44:43.120 --> 44:48.440 So being able to identify references to known names and entities in the text themselves. 44:48.440 --> 44:55.040 This can be really useful if people do follow a common standard of naming campaigns and 44:55.040 --> 44:56.040 threat actors. 44:56.040 --> 45:01.800 These can be used for named entity recognition techniques. 45:01.800 --> 45:08.840 So everything that we've used here is based off open source technology. 45:08.840 --> 45:14.520 And if you do want to tinker, and I believe quite a lot of you here are hobbyists, if 45:14.520 --> 45:18.600 you want to build your own stack, where do you go about putting this together? 45:18.600 --> 45:19.880 There aren't a lot of NLP people here. 45:19.880 --> 45:26.800 We had to go about trying to rely on the knowledge of our team to really understand how do you 45:26.800 --> 45:28.480 get to put this whole stack together. 45:28.480 --> 45:32.320 So Apache Open NLP, definitely look that up. 45:32.320 --> 45:38.120 Excellent resource of both implemented algorithms and documentation. 45:38.120 --> 45:43.600 They have all the underlying technologies that you need for parsing the sentences, applying 45:43.600 --> 45:49.840 stemming, doing any kind of part of speech tagging, any kind of basic parsing. 45:49.840 --> 45:55.520 They have a lot of great implemented algorithms already out there. 45:55.520 --> 46:00.960 The Stanford National Language Processing Group, again, a great wealth of information 46:00.960 --> 46:01.960 over there. 46:01.960 --> 46:09.760 They also have released a lot of their algorithms on their website. 46:09.760 --> 46:14.800 And WordNet, it's essentially a lexical database of English. 46:14.800 --> 46:20.480 So whenever you're trying to do similarity scores between words that could have the same 46:20.480 --> 46:25.960 meaning or could be related, WordNet is the world's largest lexical database. 46:25.960 --> 46:30.960 So definitely that's a core component of what we are using. 46:30.960 --> 46:36.960 And then the TextRank algorithm, again, this is a core part of how we are doing the lexical 46:36.960 --> 46:41.840 similarity matching or how we intend on implementing lexical similarity matching. 46:41.840 --> 46:43.800 All of this is open source. 46:43.800 --> 46:47.680 It's available out there for you to go download. 46:47.680 --> 46:52.020 And one of the things that we want to do over the next, in the near future, is we'll be 46:52.020 --> 46:59.760 setting up, we're looking to set up a GitHub repository and at least start a flow of information 46:59.760 --> 47:05.560 of our learnings into the open community. 47:05.560 --> 47:09.680 So yeah, with that, I think we still have a few minutes for questions. 47:09.680 --> 47:13.320 I'm sure you guys have a few. 47:13.320 --> 47:15.520 And thanks for your time. 47:15.520 --> 47:20.760 And if you do want to follow us, you can follow me on Twitter. 47:20.760 --> 47:21.760 Hit me up with any questions. 47:21.760 --> 47:27.160 Any if you want to take anything offline, we'll be happy to start a conversation. 47:27.160 --> 47:36.000 But with that, with whatever little time that we have left, we'd be happy to take any questions. 47:36.000 --> 47:58.240 Thank you.